Beta-Variational Autoencoder Experiments

Beta-Variational Autoencoder Experiments#

Introduction#

We implement and experiment on the beta-Variational Autoencoder framework introduced by Higgins et al. published in ICLR in 2017

TLDR: The idea is we tweak the vanilla Variational Autoencoder (VAE) by adding a coefficient to the Kullback-Leibler term of the ELBO loss function in order to give us more control of the latent representations, in particular, the separability of the representations across the latent space or latent variable’s dimensions.

For example, for learning the latent space representation of faces, we may want that one dimension represents hair length, another dimension represents skin color, and so forth. This will allow us finer control of the images we generate via VAE.

The vanilla VAE’s loss function can be derived from first noting that we want to maximize the log likelihood of the latent space -> data space mapper, which is parameterized by a neural network with nonlinear activation. Now it turns out because of the nonlinearity, we cannot simply come up with a closed form, tractable likelihood function. So we will have to rely on tricks like maximizing the lower bound of the likelihood function via Expectation-Maximization algorithm.

And through a bunch of maths and numerical tricks to overcome our inability to we ultimately end up with the error function which is the negative estimated lower bound of likelihood having the form:

\(Loss = A + B\)

Where,

  • A = Negative Log Likelihood of data given latent and neural network parameters. For our case, this is equivalent to Mean Squared Error due to our Gaussian assumptions regarding our data.

  • B = KL Divergence between the prior distribution of latent variables and the posterior given we observe the data and neural network parameters

Now one limitation of VAE is lack of control of the learned representations and entanglement of useful properties. The solution of beta-VAE is simple – simply add a Beta coefficient such that the loss becomes

\(Loss = A + \beta B\)

Where we can adjust the value of \(\beta\) to encourage disentanglement of features

In this notebook, we experiment on the value of this \(\beta\) and check how it affects the learned representations of our VAE trained to reconstruct CelebrityA dataset

We preprocessed the data to be 64x64 RGB images with pixel values from 0-1. We then normalize them to have 0 mean and standard deviation of 1 across each pixel.

Our VAE architecture follows the architecture mentioned in the beta-VAE paper by Higgins, et al. However, instead of using 32 dimensions for our latent variable, we used 6 dimensions only.

We visualize below a random sample of the preprocessed images for training.

../_images/53f0f9f910bb38b9ca52290f3b021a9a51d79bce7b94de143e98072a6a207d2f.png

We show in the animation below how a vanilla VAE (Beta coefficient = 1) learns to reconstruct the input image by first squeezing into the posterior probability latent variable and then reconstructing from that latent variable to the data space. Ideally, these images match exactly, but we can’t really expect this given that we are representing each 64 x 64 pixel image into a vector of just 6 numbers!

<Figure size 640x480 with 0 Axes>

We show in the animation below how changes in each of the 6 dimensions of our latent variable affect the output image. In the image below, each row is a latent dimension, and each column is a unique sample. For each step in the animation we move from -1 to 1 for each latent dimension while keeping the rest constant.

It’s quite difficult to discern what each dimension does. The 2nd and 6th dimensions (rows) seem to increase or decrease hair.

<Figure size 640x480 with 0 Axes>

Below, we changed Beta to 0.1. Which makes the effect of KL divergence very low. In this case, we see that it seems the images have better reconstruction compared to vanilla VAE above. Remember than in this case, the model just mostly ends up minimizing the mean squared error between the input and output.

<Figure size 640x480 with 0 Axes>

We show below how the dimensions of the latent variable affects the output. Note that like above, these are generated by randomly sampling from a Gaussian prior latent variable to generate a vector with 6 elements then passing through the decoder neural network to reconstruct the image.

It seems some latent variables have something to do with lighting or hair.

<Figure size 640x480 with 0 Axes>

Below, we increased Beta to 2. Again, we show the reconstruction quality over training iterations. They look good as well.

<Figure size 640x480 with 0 Axes>

I am not sure if it’s just my bias, but it seems to me the latent dimensions have more distinct effects now. For example, the 2nd latent dimension (2nd row) looks like it’s related to lighting angle, the 4th row is related to darkness and lightness, the 5th row is related to gender, and the 6th row is related to hair length.

<Figure size 640x480 with 0 Axes>

Lastly we tried for Beta = 10.

<Figure size 640x480 with 0 Axes>

In this case, it seems the latent dimensions change the lighting mostly.

<Figure size 640x480 with 0 Axes>

Some thoughts#

This has been a fun exercise. I first studied the theory of VAE, then I was then able to implement it, with some tweaks I did not anticipate. For example, in coding ELBO loss, I didn’t anticipate the need to change the posterior term into MSE. But knowing that MSE is what’s minimized when the log likelihood of a Gaussian is maximized, I went with MSE.

Analyzing the disentangled results are not easy visually. The original authors of B-VAE derived a disentanglement metric to quantify this. It seems changing the Beta to a higher value is still a hit or miss.